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1.  Introduction. 


This  paper  is  concerned  with  the  efficient  simulation  of  certain  steady-state  quantities  in  models 
of  highly  dependable  computing  systems.  In  particular,  we  will  consider  techniques  for  accurately 
estimating  the  steady-state  unavailability,  U ,  for  models  in  which  the  failure  and  repair  time 
distributions  are  generally  distributed.  Because  the  system  being  modeled  is  assumed  to  be  highly 
dependable,  system  failure  events  are  rare  and  therefore  U  ~  0.  Standard  simulation  of  such 
systems  require  enormous  sample  sizes  in  order  to  accurately  estimate  U\  typically,  the  closer  U  is 
to  zero,  the  longer  a  standard  simulation  needs  to  be  run. 

We  seek  to  avoid  such  long  run  lengths  by  using  the  technique  of  importance  sampling  [11],  [14]. 
In  importance  sampling,  the  system  is  simulated  using  a  new  set  of  input  distributions  (e.g.,  failure 
distributions)  that  are  chosen  in  such  a  way  as  to  make  the  rare  event  much  more  likely  to  occur. 
An  unbiased  estimate  is  then  obtained  by  multiplying  the  output  of  the  simulation  experiment  by 
the  likelihood  ratio.  If  the  new  method  of  sampling  is  chosen  properly,  then  the  variance  of  the  new 
estimator  will  be  much  less  than  the  variance  of  the  standard  estimator.  Importance  sampling  has 
been  effectively  employed  in  a  variety  of  situations,  including  queueing  models  (see,  e.g.,  [9], [25] 
and  [26]).  Another  approach  to  variance  reduction  when  estimating  long-run  averages  in  models 
of  communication  networks  is  considered  in  [21], 

Its  use  in  simulating  highly  dependable  systems  of  the  type  described  in  [12]  has  been  studied 
in  [3],  [4],  [13],  [15],  [16],  [17],  [19],  [20],  [22],  [23],  [24],  [28]  and  [29].  A  number  of  these  references 
prove  that  when  the  importance  sampling  distribution  is  chosen  according  to  a  certain  heuristic, 
then  the  resulting  estimator  satisfies  the  “bounded  relative  error”  property.  For  example,  if  U  is  an 
estimator  of  U,  then  U  has  bounded  relative  error  if  Standard  Deviation[(/]/f7  remains  bounded 
even  as  U  — *  0.  In  practice,  bounded  relative  error  implies  that  only  a  fixed  number  of  samples 
are  required  to  accurately  estimate  U ,  no  matter  how  small  U  is,  i.e.,  no  matter  how  rare  system 
failure  events  become.  The  above  papers  have  considered  two  distinct  situations: 

1.  Estimation  of  steady-state  quantities  such  as  the  long  run  unavailability  U. 

2.  Estimation  of  transient  quantities  such  as  the  reliability  R(t)  which  is  defined  as  the  probability 
that  the  system  does  not  fail  before  some  fixed  time  t. 

Results  on  steady-state  estimation  have  basically  been  restricted  to  cases  in  which  the  compo¬ 
nent  failure  time  distributions  Eire  exponentially  distributed.  This  restriction  is  required  because 
the  steady-state  estimators  exploit  the  regenerative  structure  of  such  models  (see,  e.g.,  [6]).  In  this 
case,  because  of  the  exponential  assumption,  regenerations  occur  whenever  the  model  enters  the 
state  in  which  all  components  are  operational.  This  permits  a  ratio  representation  of  steady-state 
quantities,  e.g.,  U  =  E[Di]IE[Ci\  where  D{  is  the  toted  amount  of  downtime  during  the  ith  regen¬ 
erative  cycle  and  Cj  is  the  length  of  the  ith  regenerative  ey^le.  (The  time  between  regenerations 
is  called  a  cycle.)  In  addition,  different  regenerative  cycles  axe  i.i.d.  (independent  and  identically 
distributed)  thereby  permitting  straightforward  variance  estimation  and  formation  of  confidence 
intervals. 
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now,  these  techniques  have  not  been  successfully  applied  to  estimating  steady-state  quantities  due 
to  the  fact  that  an  entrance  to  the  “all  components  up’’  state  no  longer  constitutes  a  regeneration. 
In  this  paper,  we  show  how  these  techniques  can  be  extended  to  estimating  steady-state  measures 
in  systems  without  regenerative  structure.  The  approach  makes  use  of  a  ratio  representation  for 
steady-state  quantities  in  terms  of  “.4-cycles”  that  is  similar  to  that  obtained  in  a  regenerative  set¬ 
ting,  but  which  is  more  generally  applicable.  Here,  a  new  .4-cycle  is  defined  to  start-  whenever  the 
process  enters  some  set  of  states  A.  (In  our  setting,  A  =  all  components  operational.)  However,  the 
4,-cycles  are  no  longer  i.i.d.,  which  somewhat  complicates  the  use  of  importance  sampling,  variance 
estimation,  and  the  formation  of  confidence  intervals.  Approaches  for  effectively  dealing  with  these 
problems  will  be  described  in  the  paper.  In  particular,  we  employ  a  “splitting”  technique  which 
uses  importance  sampling  to  estimate  the  expected  downtime  during  an  A-cycle  and  uses  standard 
simulation  to  estimate  the  expected  length  of  an  A-cycle. 

While  we  have  not  (yet)  proven  that  this  approach  gives  rise  to  estimates  with  bounded 
relative  error,  the  method  is  shown  to  be  highly  effective  in  practice.  In  addition,  if  the  failure 
and  repair  distributions  are  exponential,  then  the  approach  is  closely  related  to  using  “balanced 
failure  biasing” [28]  and  “measure  specific  dynamic  importance  sampling”  (MSDIS)  [13]  which, 
for  Markovian  models,  was  shown  to  have  bounded  relative  error  for  estimating  the  steady-state 
unavailability  in  [28].  Also,  when  the  failure  and  repair  times  are  generally  distributed,  then  the 
importance  sampling  heuristic  used  is  known  to  produce  an  estimate,  of  the  transient  reliability 
R(t),  having  bounded  relative  error  [16].  For  these  reasons,  we  conjecture  that  (under  appropriate 
technical  conditions)  the  method  does  produce  estimates  of  the  steady-state  unavailability  with 
bounded  relative  error  when  failure  and  repair  times  are  generally  distributed. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2  the  ratio  representation  is  discussed. 
The  issues  of  how  to  effectively  combine  importance  sampling  with  this  ratio  representation  for 
general  rare  event  estimation,  and  how  to  estimate  the  variance  are  also  described  in  Section  2. 
The  application  of  these  results  to  estimating  the  steady-state  unavailability  and  our  particular 
importance  sampling  heuristic  are  described  in  Section  3.  Experimental  results  are  presented  in 
Section  4,  and  the  results  axe  summarized  in  Section  5. 


2  Estimation  of  Steady-State  Measures  in  Non-Regenerative  Systems 

Consider  systems  that  can  be  modelled  as  Generalized  Semi-Markov  Processes  (GSMP’s).  For 
a  detailed  exposition  on  GSMP’s  the  reader  is  referred  to  [10]  (see  also  [23]  for  an  alternative 
description).  Roughly  speaking,  these  are  systems  which  axe  characterized  by  an  output  state 
vector,  say  X (f),  that  takes  value  in  Z1,  and  an  internal  state  vector,  say  S(t),  that  takes  values 
in  Rm  (Z  =  set  of  integers,  l  is  a  positive  integer  and  m  is  a  non-negative  integer).  The  choice 
of  the  output  state  vector  depends  on  the  application  at  hand  and  the  desired  level  of  detail. 
The  S(t)  is  defined  such  that  it  has  enough  information  about  the  history  of  the  system,  so  that 
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{(Jf(s),  S(s))  :  s  >  t)  depends  on  {(.Y(s),  5(s))  :  0  <  s  <  t}  only  through  (A'(f),  S(f)).  Let 
/(.Y(f))  be  some  bounded  real  valued  function  on  the  output  state  space  <5.  For  example,  consider 
a  system  with  N  different  components.  Each  component  has  generally  distributed  failure  and 
repair  times.  The  age  of  an  operational  component  is  the  time  since  it  is  last  became  operational. 
The  system  has  many  repairmen  (each  component  is  assigned  a  particular  repairman)  which  repair 
components  using  the  FCFS  service  discipline,  and  components  interact  by  sharing  repairmen.  In 
this  case  we  may  define  Ji 1(f)  to  be  an  ^-dimensional  vector,  the  ith  element  of  which  is  1  if  the 
t’th  component  is  up  and  0  otherwise.  The  internal  state  vector  S(t)  may  be  defined  to  include  the 
ages  of  the  components  that  are  operational,  the  repair  queue  and  the  elapsed  repair  time  (if  any) 
at  each  repairman.  For  the  purposes  of  availability  estimation,  the  function  f(X(t))  may  be  given 
a  value  1  if  in  the  state  X(t)  enough  components  are  operational  for  the  system  to  be  considered 
up;  otherwise,  it  is  given  a  value  of  0. 

Under  fairly  general  conditions  that  ensure  ergodicity  (which  are  analogous  to  recurrence  type 
properties  in  Markov  chains  with  countable  state  space),  as  t  — *  oo,  the  quantity  f*  f(X(s))ds/t 
converges  to  a  constant  with  probability  1.  Let  us  denote  this  constant  by  U.  Our  goal  is  to 
estimate  this  steady-state  measure  U. 

2.1  Standard  Estimation  Procedures  for  Steady-State  Measures 

Let  A  denote  a  length  of  time.  Define  D,  =  /(-Y(s))ds/A  and  form  the  estimate  U  = 

J3"=1  Di/n.  Then  by  definition,  as  n  — *  oo,  17  —*  U  with  probability  1.  Also,  under  some  more 
regularity  conditions,  we  have  the  central  limit  theorem  (CLT):  y/n(U  —  U)=>  N(0,  cr2),  as  n  — >  oo, 
where  cr2  is  a  variance  constant.  In  that  case,  if  we  knew  cr2,  we  could  construct  a  (1—  6)%  confidence 
interval  (Cl)  for  U.  The  half  width  (HW)  of  this  Cl  will  be  given  by  z&/2a /y/ii,  where  z6/2  is  the 
100(1  —  S/2 )  percentile  point  of  the  standard  normal  distribution  (see  [1]).  If  the  Di's  were  i.i.d., 
then  <r2  =  Var(Di)  which  would  be  easy  to  estimate.  If  we  discard  some  of  the  initial  D/s  (i.e., 
allow  the  system  to  reach  steady-state),  then  the  Di's  from  then  on  are  (approximately)  identically 
distributed  but  still  not  independent.  Hence  the  method  of  batch  means  (see  [1])  can  be  used  to 
estimate  cr2.  We  now  briefly  review  this  procedure. 

In  the  method  of  batch  means  we  group  the  D/s  into  batches,  each  batch  having  k  successive 
Di's,  i.e.,  if  b  is  the  number  of  batches,  then  kb  =  n.  Let  Sj  =  22i=(j-i)k+i  Di/k  for  1  <  j  <  b. 
Then  we  form  the  estimate  U  =  =  $Z"=1  Di/n  which  is  the  same  estimator  as  before. 

The  method  of  batch  means  make  use  of  the  assumption  that  for  sufficiently  large  k  the  6/s  are 
(approximately)  normally  distributed  and  uncorrelated.  If  we  discard  the  first  few  6j  s  to  allow 
for  the  system  to  reach  steady-state,  then  the  Sj ’s  are  also  identically  distributed.  In  that  case  we 
again  have  the  CLT,  where  now  k  — ►  oo.  If,  in  addition,  b  is  large,  then  a2  =  Var(Sj)  can  easily 
be  estimated. 

Consider  f(X(s))  of  the  form  l{x(»)6^}  (the  indicator  function),  where  X  C  S  is  a  rare  set  of 
states  (but  with  non-zero  probability).  In  this  case,  most  of  the  Di's  and  thus  the  Sj' s  will  be  zero, 
and  therefore,  as  mentioned  in  the  introduction,  it  is  hard  to  get  accurate  estimates.  So  techniques 
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like  importance  sampling  have  to  be  used.  In  the  following  sections  we  develop  a  method  based 
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techniques  of  batch  means,  splitting  and  importance  sampling  to  efficiently  estimate  the  ratio. 


2.2  A  Ratio  Representation  of  Steady-State  Measures 


Let  A  C  $■  As  in  the  introduction,  an  4-cycle  is  defined  to  start  whenever  the  {X(t)  :  t  >  0} 
enters  A.  Let  C  be  the  length  of  an  4-cycle.  Let  q>  denote  the  probability  dynamics  governing  the 
realizations  of  {(A'(<),  S(t))  :  t  >  0}.  Let  7 r  be  the  steady-state  distribution  of  ( X(t),S(t ))  at  the 
times  when  (A'(t)  :  t  >  0}  enters  A.  Then  under  fairly  general  recurrence  type  conditions  (which 
also  ensure  that  the  system  returns  to  state  A  infinitely  often)  we  have  that  (see  [5]  and  Section 
6.9  of  [2]  for  details;  see  also  [7]  ) 

En.<t>{D) 


U  = 


EnACY 


(1) 


where  now  D  =  f^_Q  l{.v(«)g^}ds  and  the  subscripts  in  the  expectation  denote  the  initial  distribu¬ 
tion  and  the  probability  dynamics  governing  the  realizations  of  {(A’(t),  S(t))  ■  t  >  0}. 

To  make  use  of  this  ratio,  we  can  first  run  the  process  for  some  time  so  that  it  reaches 
steady-state  and  then  run  n  .4-cycles  to  get  n  samples  of  D  and  C.  However,  as  in  the  standard 
estimation  procedure  described  above,  there  are  two  problems.  First,  because  the  set  T  is  rare, 
most  of  the  samples  of  D  will  be  zero,  leading  to  an  inaccurate  estimate  of  E(D).  To  overcome 
this  we  use  importance  sampling.  Second,  the  samples  of  D  (and  C)  are  identically  distributed 
but  not  independent.  This  can  be  handled  using  the  method  of  batch  means. 


2.3  Efficient  Simulation  of  Rare  Events  Using  Importance  Sampling 

Let  <p'  denote  a  new  probability  dynamics  that  now  governs  the  realizations  of  {(A" [t),  S{t))  :  t  >  0} , 
such  that  the  probability  on  the  sample  paths  of  {(A(<),  S(f))  :  t  >  0}  using  <p  is  absolutely  contin¬ 
uous  with  respect  to  the  probability  on  the  sample  paths  using  <p'  (absolutely  continuous  essentially 
means  that  any  event  that  had  a  positive  probability  of  occurrence  under  the  original  dynamics  also 
has  a  positive  probability  of  occurrence  under  the  new  dynamics).  When  importance  sampling  is 
used,  we  can  write  E„^{D)  =  E„^i(DL)  (the  second  subscript  in  the  second  expectation  indicates 
that  we  are  using  the  new  probability  dynamics  <p'  to  generate  D  and  L),  where  L  is  the  likelihood 
ratio.  The  main  problem  in  importance  sampling  is  to  choose  an  easily  implementable  <p'  so  that 
Var^^^DL)  «  Varn^(D).  Then  for  estimating  the  ratio  we  can  write  Equation  1  as 

E„,y(DL)  px 

E^iC) 

and  use  the  following  simulation  scheme.  We  first  run  a  few  A-cycles  using  q>  so  that  the  system 
enters  steady-state  and  we  are  assured  of  (A'(f),  S(t))  being  distributed  sufficiently  close  to  n  at 
the  start  of  .4-cycles.  Then  we  do  a  splitting  technique  (see  [14])  in  each  of  the  4-cycles,  where  we 
do  one  run  using  the  dynamics  <p'  to  get  samples  of  D  and  L  and  a  second  run  with  the  original 
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dynamics  <p  to  get  a  sample  of  C.  The  second  run  also  ensures  that  we  again  get  the  distribution 
rr  when  the  system  re-enters  state.. 4.  so  that  we  can  start  another  .4-cycle  at  that  point  in  time. 
We  repeat  this  procedure  to  get  the  samples  D,,  Li  and  C',,  1  <  i  <  n,  of  D,  L  and  C,  respectively. 
This  approach  is  very  analogous  to  “measure  specific  dynamic  importance  sampling”  (MSDIS)  [13] 
for  estimating  the  steady-state  unavailability  in  Markovian  systems;  importance  sampling  is  used 
to  estimate  the  expected  downtime  in  a  cycle  and  standard  simulation  is  used  to  estimate  the 
expected  cycle  time. 

2.4  Variance  Estimation  Using  the  Method  of  Batch  Means 

Since  the  .4-cycles  axe  not  independent,  we  use  the  method  of  batch  means  to  estimate  the  variance. 
As  before  let  b  be  the  number  of  batches  and  let  k  be  the  batch  size.  Let  6j  =  DiLi/k 

and  7j  =  J2i=(j-i)k+i  Ci/k.  The  we  form  the  estimate 
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where  6  =  6j/b  =  £"=1  DiLi/n,  and  7  =  £‘=1  7>/6  =  £"=1  Ci/n. 

In  the  steady-state,  for  sufficiently  large  t,  the  Sj's  (and  7/s)  are  micorrelated.  We  introduce 
the  generic  random  variables  6  and  7  having  the  same  distributions  as  6}  and  7,,  respectively. 
It  follows  that  E(6 )  =  En.^‘(8)  and  E( 7)  =  En.<t>i~t)-  Also,  Var(6)  =  Varn^-(6)/b,  Var( 7)  = 
Var„  ^(j)/b  and  Cov(6, 7)  =  Cov*  #  The  subscripts  in  the  covariance  term  indicate  that 

for  each  i,  (D,,  Li)  (used  in  the  6/s)  and  C,  (used  in  the  7/s)  are  sampled  using  the  same  starting 
state  (i.e.,  distributed  according  to  7r),  and  using  <p'  and  <p,  respectively.  Finally,  from  the  CLT  we 
have  y/b(U  -  U)  «  N(Q,(J2)  for  large  k  and  b,  where  (analogous  to  the  regenerative  method) 

2  Var „,</,<  (5)  +  U2  x  Var„^{ 7)  -  2  x  U  x  Cov*.#  .4(6, 7)  ,  . 

' = - -  (4) 


3  Estimation  of  Unavailability  in  Highly  Dependable  Systems 

The  class  of  highly  dependable  systems  considered  in  this  paper  is  that  composed  of  highly  reliable 
components,  i.e.,  the  mean  time  between  failures  (MTBF)  of  its  components  is  orders  of  magnitude 
larger  than  their  repair  time.  High  dependability  cm  also  be  achieved  by  increasing  the  redundancy 
level  of  less  reliable  components;  however,  here  we  are  not  concerned  with  this  type  of  systems. 

Without  loss  of  generality,  we  consider  models  of  highly  dependable  systems  in  which  a  com¬ 
ponent  may  be  in  one  of  only  two  states:  operational  and  failed.  When  a  component  fails  in  a  given 
mode,  it  may  cause  other  components  to  fail  with  some  probability  (failure  propagation).  Different 
sets  of  components  may  be  affected  at  different  failure  modes.  Failed  components  me  repaired  by 
one  or  more  repair  facilities  according  to  some  arbitrary  service  discipline.  Basically,  these  models 
me  similm  to  those  that  can  be  handled  using  the  SAVE  package  [12].  However,  unlike  SAVE,  we 
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allow  general  distributions  for  failure  and  repair  times,  and  repairs  are  assumed  not  to  be  instan¬ 
taneous.  Let  Gi(x)  denote  the  failure  distribut 


IGii  of  v  viiipGliCiib 


is  given  by  /i,(x)  =  gi(x)/Gl(x),  where  g,{x)  is  the  probability  density  function  corresponding  to 
Gi(x)  and  G,(x)  =  1  —  G,(x).  We  further  parameterize  the  hazard  rate  function  in  terms  of  a 
small  (but  positive)  parameter  e  and  assume  it  is  bounded,  such  that  h{(x)  <  A.c6, ,  x  >  0,  where 
0  <  Aj  <  oo  and  6;  >  1.  As  will  be  further  discussed  in  Section  3.1,  the  assumption  of  bounded  fail¬ 
ure  hazard  rate  functions  (which  holds  for  many,  including  phase-type,  distributions)  is  necessary 
in  order  to  use  the  uniformization  approach.  Weibull  is  not  included  in  the  class  of  bounded  haz¬ 
ard  rate  distributions,  however,  it  can  be  arbitrarily  well  approximated  by  appropriately  bounding 
its  hazard  rate  function;  this  will  enable  us  to  experiment  with  an  increasing  failure  rate  Weibull 
distribution,  as  will  be  described  in  Section  4.3. 


3.1  A  Uniformization-Based  Importance  Sampling  Approach 

The  uniformization  technique  (also  known  as  randomization  [IS] )  can  be  used  to  sample  from  gen¬ 
eral  distributions  (e.g.,  nonhomogeneous  Poisson  processes)  with  bounded  hazard  rate  functions. 
Such  distributions  include  Markovian  phase-type,  but  exclude  discrete,  uniform  and  Weibull  dis¬ 
tributions.  To  illustrate,  consider  simulating  the  nonhomogeneous  Poisson  process  with  a  bounded 
hazard  rate  h(t)  <  0,  where  0  is  a  constant  rate.  We  generate  the  event  times  {Tfc},  k  =  1,2, ... , 
of  a  homogeneous  Poisson  process  at  rate  /?  (fi  is  called  the  uniformization  rate.)  T*  is  accepted 
as  an  actual  event  of  the  simulated  process  with  probability  /i(Tjt)//?  (real  event),  otherwise,  it  is 
rejected  (pseudo  event).  The  acceptance/rejection  test  is  performed  at  consecutive  uniformization 
events  until  an  event  is  accepted,  in  which  case  the  same  procedure  is  repeated  to  generate  the 
next  (real)  event  of  the  simulated  nonhomogeneous  Poisson  process. 

In  the  context  of  reliability  estimation,  uniformization  can  be  used  to  implement  importance 
sampling  in  Non-Markovian  models  as  described  in  [24].  In  a  similar  way,  it  can  also  be  used  for 
unavailability  estimation  as  we  will  describe  in  this  section. 

Consider  a  system  with  N  components.  At  any  time  t,  let  0{t)  be  the  set  of  operational 
components,  and  denote  by  a;(t),i  G  0(t),  the  age  of  component  i  (i.e.,  the  time  since  it  is  last 
became  operational).  Define  the  failure  rate  of  component  i  at  time  t  to  be  A,(t),  then 


f  if  i  €  0{t) 

\  0,  otherwise. 


The  total  failure  rate  at  time  t  is  given  by  Ar(f)  =  Ej,,  A;(<). 

Let  {r„},n  =  1,2,...,  correspond  to  the  real  event  times  in  the  system  (i.e.,  failure  and 
repair  events).  At  real  event  r„,  let  A f(t),t  >  r„,  be  bounded  by  a  constant  rate,  say,  (3n  (i.e., 
^f(0  <  >  Tn).  Then  the  time  to  next  failure  event  in  the  system  cam  be  sampled  by 

successively  generating  the  Poisson  (uniformization)  event  times  {Tn*},Jt  =  1,2,...,  at  rate  /?„, 
and  performing  the  acceptance/rejection  test  until  an  event  is  accepted.  Event  Tn*  is  rejected  with 
probability  1  —  Af(Tnj;)/ pn,  in  which  case  the  next  uniformization  event  is  generated.  Otherwise, 


Tnk  is  accepted  as  the  next  failure  event  time.  This  procedure  is  repeated  at  every  real  event 
(failure  or  repair)  to  generate  the  iime  to  next  failure  event.  In  the  original  system,  the  expected 
time  to  next  failure  event  (/)"*)  is  much  larger  than  the  expected  time  to  next  repair  event,  if  any. 
Therefore,  a  system  failure  is  very  unlikely  to  occur. 

As  we  simulate  the  system,  repairs  are  sampled  from  their  original  distributions,  which,  there¬ 
fore,  are  not  restricted  to  the  class  of  bounded  hazard  rate  functions.  In  particular,  this  allows 
arbitrary  repair  time  distributions,  including  general  discrete  and  uniform  distributions.  It  is  as¬ 
sumed  that  failure  propagation  probabilities  are  not  changed  in  the  simulated  system.  It  follows 
that  uniformization  (with  importance  sampling)  is  used  only  to  simulate  the  time  of  failure  events. 

To  implement  importance  sampling  using  uniformization  we  do  the  following.  At  real  event 

r„  (during  a  repair),  we  can  simply  increase  the  uniformization  rate,  /?„,  rind  fix  the  acceptance 

def 

probability  at  some  level,  say,  pn,  such  that  cvn  =  /3nPn  (this  is  the  effective  rate  at  which  the  next 
failure  event  is  generated;  we  call  it  the  “biasing  level'’)  is  of  the  same  order  as  the  repair  “rate'’. 
This  will  increase  the  probability  of  subsequent  failure  events  leading  to  a  system  failure. 

Upon  the  occurrence  of  a  failure  event  in  the  original  system,  say,  at  rn,  component  i  is 
selected  as  the  failed  component  with  probability  p„,  =  A;(Tn)/Af'(rn).  However,  with  importance 
sampling,  this  probability  could  be  changed.  For  example,  in  “balanced  failure  biasing’’  [28],  we 
equalize  the  failure  probability  for  all  operational  components.  In  this  case,  component  i  £  0(rn)  is 
selected  to  fail  with  probability  p„,  =  l/|0(r„)|.  In  addition  to  being  simple  and  robust,  “balanced 
failure  biasing”  is  known  to  be  provably  effective  in  the  context  of  unreliability  estimation  [16], 
[28],  [29], 

The  likelihood  ratio  is  computed  recursively,  by  updating  it  only  at  pseudo  and  failure  event 
times  as  follows.  Let  Lk,  k  =  0, 1, 2, . . . ,  be  the  likelihood  ratio  at  the  kth  (pseudo  or  failure)  event, 
at  time  t*,  then  Lq  =  1  and 


Lit  =  Lk-i  x 


(  » )  (  ^(tk)/XF(tk)  ) )  if  ^  •  failure  event 

<  if  pseudo  event. 


(5) 


In  other  words,  the  likelihood  ratio  is  updated  by  a  factor  equal  to  the  ratio  of  the  probability  of 
the  kth.  event  in  the  original  and  simulated  systems,  respectively.  Notice  in  the  above  equation 
that  /3k, pk  and  Pki  can  be  changed  at  pseudo  events;  however,  in  our  implementation  (as  described 
above)  they  are  changed  only  at  real  (failure  or  repair)  events.  Heuristics  for  choosing  /3k, Pk 
and  pki,  as  well  as  other  practical  considerations,  are  discussed  in  Section  3.2.  For  unreliability 
estimation  [16],  it  is  shown  that  under  reasonable  assumptions  and  appropriate  heuristics,  the  above 
method  is  provably  effective,  i.e.,  it  yields  estimates  with  bounded  relative  error.  In  Section  4  we 
experimentally  demonstrate  the  effectiveness  of  our  method  for  unavailability  estimation.  However, 
theoretical  results  to  establish  the  property  of  “bounded  relative  error”  are  not  yet  available. 


3.2  Implementation  Issues 

In  this  section  we  consider  specific  implementation  issues  in  the  estimation  of  steady-state  un¬ 
availability  using  uniformization  and  importance  sampling  as  discussed  in  the  previous  sections. 
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In  our  implementation  we  use  CSIM  [27],  a  process-oriented  simulation  language  based  on  the  C 
programming  language. 

Following  our  discussion  in  Section  2.3  as  applied  to  highly  dependable  systems,  we  define 
an  A-cycle  to  be  a  sample  path  between  two  successive  entries  into  the  fully  operational  state  (in 
which  all  system  components  are  operational).  Specifically,  in  our  context,  the  set  A  constitutes 
all  possible  components’  ages  upon  entering  the  fully  operational  state.  (Notice  that  upon  entering 
A  at  least  one  component  has  an  age  identical  to  zero). 

The  ratio  representation  of  the  steady-state  system  unavailability  U  is  given  by  Equation  2, 
where  C  is  the  length  of  an  A-cycle  in  the  original  system,  D  is  the  total  downtime  in  an  /1-cycle 
in  the  simulated  system  (with  importance  sampling)  and  L  is  the  corresponding  likelihood  ratio. 
An  estimate  of  the  steady-state  unavailability  is  given  by  Equatio..  3,  where  6  and  7  cure  estimates 
of  E„^'(DL)  and  En. <(,{€),  respectively.  Recall  that  the  subscripts  tt  and  <p  (</>')  indicate  that 
the  expectation  is  taken  over  A-cycles  having  the  typical  steady-state  entry  distribution  n ,  with 
respect  to  the  original  (new)  probabilit’  dynamics. 

The  system  is  simulated  sufficiently  long,  under  the  original  probability  dynamics  <p,  until  it 
(approximately)  reaches  the  steady-state.  From  that  point,  let  n  be  the  number  of  A-cycles  used 
to  obtain  the  estimate  7.  Since,  in  general,  successive  .4-cycles  are  not  independent,  we  use  the 
method  of  batch  means  to  get  an  estimate  Var( 7)  of  the  variance  Far*, ^(7).  Following  the  same 
notation  as  in  Section  2.4,  let  b  be  the  number  of  batches,  each  having  k(  =  n/b)  .4-cycles.  Let  C, 
be  the  length  of  the  ith  .4-cycle,  and  for  batch  j,  let  7j  —  Ct/k.  Then  we  have 

bn  6 

7  =  YL  lj/b  =  C'/n  and  F«r(7)  =  -  7)2/(b  -  !)■ 

>= 1  >=1  i= 1 

k  should  be  sufficiently  large,  so  as  to  eliminate  dependence  between  successive  batches.  Our 
experimental  results  in  Section  4.1  indicate  that  k  need  not  be  large. 

To  obtain  the  estimate  b  we  do  the  following.  For  each  .4-cycle  that  we  simulate  under  the 
original  probability  dynamics  (we  call  this  an  ;‘originaT’  A-cycle),  we  simulate  the  same  A-cycle 
(i.e.,  starting  with  the  same  initial  components’  ages)  under  the  new  probability  dynamics,  i.e.,  with 
importance  sampling  (we  call  this  a  "biased"’  .4-cycle).  Usually  more  effort  is  needed  to  estimate 
E„.4'(DL)  than  that  to  estimate  En,^{C)  (since  there  are  typically  more  events  in  a  biased  cycle 
adn  L  must  be  computed).  Therefore,  we  run  several,  say,  m,  biased  A-cycles  for  each  original 
.A-cycle.  In  this  way,  we  use  more  cycles,  namely,  n'  =  mn,  to  obtain  the  estimate  6.  If  we  use  the 
same  number  of  batches,  6,  to  get  an  estimate  Var{b)  of  the  variance  Varn^>(6),  then  the  number 
of  biased  A-cycles  per  batch  is  equal  to  k'  =  mk.  Let  D{,  be  the  total  system  downtime  in  the 
sth  run  of  the  ith  biased  A-cycle,  and  Llt  is  the  corresponding  likelihood  ratio.  For  batch  j,  let 
=  F  Ei=(>-i)fc+i  £7=i  D»Li>-  follows  that 

b  _  n  m  6 

6  =  ]£  6j/b  =  —  ^  ^2  D„L„,  and  Var(6)  =  ^T(bj  -  6)2 /(b  -  1). 

7=1  n  i=i  .=  1  7=1 


Furthermore,  an  estimate  of  the  covariance  Covn ^>^(<5,  7)  is  given  by 
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Cov(S, 7)  =  ^(6,  -  6)  x  -7 )/(6  -  1). 
j=i 

An  estimate  a2  for  the  variance  a2  of  the  estimator  U  can  now  be  obtained  from  Equation  4. 

With  importance  sampling,  our  goal  is  to  increase  the  frequency  of  A-cycles  that  contain 
typical  system  failures.  Our  heuristic  is  similar  to  that  for  regenerative  models  in  [23].  In  a 
biased  A-cycle,  upon  the  occurrence  of  the  first  component  failure,  we  activate  failure  biasing  to 
accelerate  subsequent  component  failures  relative  to  the  current  repair  event.  Failure  biasing  is 
continued  until  either  system  failure  or  the  end  of  the  current  A-cycle.  In  doing  so,  we  increase  the 
probability  of  system  failure  in  biased  A-cycles.  More  specifically,  as  discussed  in  Section  3.1,  if 
failure  biasing  is  activated  at  the  nth  real  (failure  or  repair)  event,  th^n  the  uniformization  rate  /?„ 
and  the  acceptance  prooability  fn  could  be  chosen  such  that  a„  (recall  that  an  =  pnpn)  is  equal  to 
the  inverse  of  the  maximum  (in  case  of  concurrent  repairs)  expected  current  repair  time,  which  we 
denote  by  rn.  (For  a  single  repairman  and  exponential  repairs,  this  choice  is  equivalent  to  setting 
the  probability  of  a  failure  before  repair  to  0.5.)  This  heuristic  does  not  necessarily  lead  to  the 
most  variance  reduction,  however,  it  is  quite  effective  and  robust.  In  our  CSIM  implementation, 
we  were  not  able  to  determine  the  components  currently  under  repair;  therefore,  we  set  r„  equal 
to  the  maximum  expected  scheduled  repair  time.  We  did  not  anticipate  the  following  problem, 
however.  When  the  current  repair  time  is  much  larger  than  its  expectation,  the  biasing  level  (as 
determined  from  the  maximum  expected  scheduled  repair)  may  become  excessively  high,  causing 
untypical  failure  sequences.  This  tends  to  significantly  increase  the  variability  of  the  likelihood 
ratio,  leading  to  unstable  estimates.  We  overcome  this  problem  by  determining  the  biasing  level 
based  on  the  actual  maximum  scheduled  repair  whenever  it  exceeds  its  expectation  by  several  times 
(say,  5  times).  The  above  heuristics  have  been  shown  to  work  well,  as  will  be  demonstrated  in 
Section  4. 

At  every  uniformization  event,  the  ages  of  all  operational  components  are  adjusted  and  their 
hazaxd  rates  Eire  determined.  This  is  required  to  update  the  likelihood  ratio  depending  on  whether 
the  event  is  accepted  or  rejected  (see  Equation  5).  Since  repair  times  are  unchanged  in  the  simulated 
system,  no  updating  of  the  likelihood  ratio  is  necessary  at  repair  events. 

With  a  given  appropriate  biasing  level  ( PnPn ),  there  is  freedom  in  choosing  the  uniformization 
rate  Pn  (and  hence  the  acceptance  probability  p„).  Experiments  described  in  [24]  show  that  higher 
/?„  and  lower  pn  result  in  a  less  noisy  estimation  of  the  likelihood  ratio,  and  hence  somewhat 
lower  variance  of  the  resulting  estimate.  On  the  other  hand,  higher  pn  and  lower  p„  results  in 
an  excessive  number  of  rejected  (pseudo)  events,  i.e.,  very  inefficient  generation  of  failure  events. 
An  appropriate  uniformization  rate  should  be  chosen  low  enough  to  limit  excessive  generation  of 
pseudo  events,  yet  high  enough  to  preserve  accuracy.  Experiments  described  in  [24]  show  that 
P„  —  5 r„  is  a  good  choice  ( p„  is  then  determined  by  the  chosen  biasing  level). 
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Once  a  uniformization  event  is  accepted  as  a  failure,  then  one  of  the  operational  components 
ic  selected  as  the  failing  component.  In  our  experiments  in  Section  4  we  use  "balanced  failure 
biasing’’  (as  described  in  Section  3.1).  We  also  balance  the  first  component  failure  in  a  biased 
A-cycle.  This  is  quite  important,  particularly  for  "unbalanced’’  systems  (e.g.,  when  component 
reliabilities  are  of  different  orders  of  magnitude). 

4  Experimental  Results 

In  this  section  we  experiment  with  our  method  and  demonstrate  its  effectiveness  for  the  estimation 
of  steady-state  unavailability  in  highly  dependable  systems.  We  use  small  and  large  examples 
with  general  failure  and  repair  time  distributions.  In  special  cases,  some  examples  conform  to  the 
class  of  models  having  a  "product  form”  solution.  In  these  cases  the  results  are  invariant  with 
respect  to  failure  time  distributions  having  the  same  mean.  Therefore,  we  are  able  to  validate  with 
numerical  (non  simulation)  results  obtained  using  SAVE.  Assuming  exponential  failure  and  repair 
time  distributions,  models  not  having  a  product  form  solution  can  also  be  validated  using  SAVE. 

Some  of  our  design  choices  are  based  on  earlier  work  (e.g.,  in  [13],  [23]  and  [24]).  For  example, 
when  failure  biasing  is  activated,  we  set  the  biasing  level  /?„pn  equal  to  rn  (as  defined  in  Section  3.2). 
An  appropriate  uniformization  rate  /?„  should  be  chosen  low  enough  to  limit  excessive  generation 
of  pseudo  events,  yet  high  enough  to  preserve  accuracy.  Experiments  in  [24]  suggests  that  /?„  =  5r„ 
is  a  good  choice  (and  hence  the  acceptance  probability  (p„)  is  set  to  0.2). 

In  Section  4.1  a  small  machine  repairman  mode)  is  used  to  experiment  with  the  batch  size 
used  in  the  batch  means  method.  The  results  indicate  that  the  accuracy  of  the  estimates  (at  least 
in  this  example)  is  insensitive  to  the  batch  size.  In  Section  4.2  another  small  example  is  used 
to  study  the  efficiency  of  the  method  under  a  variety  of  circumstances.  Tins  is  accomplished  by 
varying  the  order  of  e  (used  to  parameterize  the  failure  hazard  rate  functions,  see  Section  3)  and 
by  experimenting  with  "balanced”  surd  ;'unbsdanced”  systems.  In  Section  4.3  a  large  example  is 
used  to  demonstrate  the  effectiveness  of  our  method  when  dealing  with  large  and  complex  systems. 
In  this  example  we  experiment  with  different  failure  time  distributions,  namely,  Erlang,  Weibull, 
exponentiad  and  hyperexponentisd.  With  exponential  repair  times  (at  the  same  rate),  FCFS  (first 
come  first  served)  repair  discipline  and  no  failure  propagation,  this  example  has  a  product  form 
solution.  In  this  case  we  can  validate  our  simulation  results  with  numerical  solutions  obtained 
using  SAVE. 

For  each  table  entry  in  all  experiments  of  this  section,  we  run  a  total  of  n  =  64000  original 
A-cycles,  which  are  used  to  estimate  the  expected  cycle  length  7.  For  each  original  A-cycle,  we  run 
m  =  4  biased  A-cycles.  It  follows  that  the  total  number  of  biased  A-cycles  used  to  estimate  the 
expected  downtime  6,  is  n'  =  256000.  For  al  but  the  experiment  in  Section  4.1,  we  fix  the  number  of 
batches  b  to  1000.  Accordingly,  the  batch  size  k  ( k ')  is  fixed  at  64  (256)  original  (biased)  A-cycles. 
In  each  table  entry  we  display  the  estimate  (from  simulation)  of  the  steady-state  unavailability, 
along  with  its  99%  half-width  confidence  interval  as  a  percentage  of  the  point  estimate. 

Except  for  special  cases,  the  models  considered  in  this  section  cannot  be  evaluated  either  an¬ 
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alytically  or  numerically.  Because  of  their  high  dependability  feature,  standard  (naive)  simulation 
is  also  not  practical.  A 3  a  result,  effectiveness  studies  i n  this  section  demonstrate  the  usefulness  of 
our  method,  as  it  considerably  extends  the  class  of  models  in  which  importance  sampling  cam  be 
used  to  evaluate  various  dependability  measures. 

4.1  Batch  Size 

As  described  in  Sections  2.3  and  3.2,  before  collecting  the  samples  Ci  (from  original  A-cycles)  and 
D,L{  (from  biased  A- cycles)  for  the  method  of  batch  means,  the  simulation  should  be  rim  long 
enough  to  reach  its  steady-state  dynamics.  This  can  be  accomplished  by  discarding  the  first  few 
batches  of  the  simulation.  Furthermore,  the  batch  size,  k ,  needs  to  be  sufficiently  large,  so  as  to 
(approximately)  eliminate  dependence  between  successive  batches.  In  this  section  we  use  a  small 
example  to  experiment  with  the  batch  size. 

We  consider  a  machine  repairman  model  with  two  component  types  and  two  components 
of  each  type.  The  system  is  considered  operational  as  long  as  one  component  of  each  type  is 
operational.  All  components  have  the  same  failure  time  distribution;  namely,  Erlang  with  two 
stages,  each  having  a  rate  equal  to  0.0002  per  hour.  Thus,  the  MTBF  of  individual  components  is 
10000  hours.  When  components  fail,  they  get  repaired  by  a  single  repairman  according  to  FCFS 
discipline.  For  all  components,  we  assume  that  the  repair  time  distribution  is  exponential  with  a 
mean  equal  to  1.0  hour.  With  exponential  repairs,  this  model  has  a  product  form  solution,  which 
depends  on  the  failure  distribution  only  through  its  mean.  Therefore,  we  can  validate  our  results 
by  solving  the  same  example,  with  exponential  failure  times  (having  the  same  MTBF,  i.e.,  10000 
hours).  Using  SAVE,  a  numerical  estimate  of  the  steady-state  unavailability  for  this  model  is  given 
by  4.0  x  10“8. 

For  the  same  total  number  n  =  64000  (»»'  =  256000)  of  original  (biased)  A- cycles,  in  Table 
1  we  successively  halve  the  number  of  batches  b  from  64000  to  250.  Accordingly,  the  batch  size 
is  successively  doubled  from  k  —  1  (k1  =  4)  to  k  =  256  (k1  =  1024).  Note  that  the  estimates  in 
the  table  compare  well  with  the  above  numerical  result  from  SAVE.  Observe  that  the  confidence 
interval  widths  do  not  depend,  in  any  significant  way,  on  the  batch  size.  (This  being  the  case 
also  for  the  smallest  batch  size  of  one  original  A- cycle.)  This  is  an  indication  that,  in  the  steady- 
state,  consecutive  A-cycles  axe  almost  uncorrelated.  While  it  seems  to  be  the  case  in  this  particular 
example,  this  is  not  generally  true  for  other  systems.  However,  additional  experiments  (not  reported 
here)  suggest  that  ‘‘near  independence’’  of  consecutive  A-cycles  (as  defined  in  Section  3.2)  may  be 
a  feature  of  highly  dependable  systems.  In  all  subsequent  experiments  we  set  the  batch  size  k  (k') 
to  64  (256);  this  is  large  enough  to  achieve  approximate  independence  between  successive  batches. 

4.2  A  Small  Example 

In  this  section  we  provide  empirical  results  illustrating  the  desirable  ‘‘bounded  relative  error” 
property  of  our  method.  We  show  that  as  system  failure  becomes  rarer,  we  can  still  estimate 
the  steady-state  unavailability  with  the  same  accuracy;  also,  when  the  system  is  ‘‘unbalanced".  In 
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Section  3  we  parameterized  the  failure  hazard  rate  functions  in  terms  of  e.  In  the  following  example 
vv c  parameterize  the  failure  time  distributions  m  terms  of  their  inverse  mean  (1/hITBF),  which  we 
denote  by  e.  By  varying  e  we  change  component  reliabilities  (i.e.,  their  MTBF),  arid,  hence,  the 
steady-state  system  unavailability.  Also,  by  having  different  component  types  with  different  e,  we 
create  examples  of  ‘‘unbalanced’’  systems. 

Again,  we  consider  a  machine  repairman  model  with  two  types  of  components;  3  components  of 
Type  I  and  2  components  of  Type  II.  The  system  is  considered  operational  as  long  as  one  component 
of  each  type  is  operational.  Failure  time  distributions  are  either  Erlang  or  hyperexponential  and 
may  be  different  for  each  component  type  (as  specified  below).  Type  II  components  have  a  higher 
(preemptive-resume)  priority  at  the  (single)  repair  facility.  The  repair  time  distribution  is  constant 
(deterministic),  at  1.0  hour,  for  Type  I  components  and  uniform,  between  0.0  and  1.0  hour,  for 
Type  II  components.  For  the  same  example,  we  perform  the  same  set  of  experiments  with  and 
without  failure  propagation.  If  failure  propagation  is  considered,  then  with  probability  0.25,  a 
failure  of  Type  II  component  causes  two  components  of  Type  I  to  fail  (all  operational  Type  I 
components  fail  if  they  are  equal  or  less  than  two). 

We  experiment  with  non-exponential  failure  time  distributions.  Specifically,  Erlang  (2,  A) 
(two  stages,  each  with  a  rate  A  per  hour)  and  Hyper  exponential  (p,  Aj,A2)  (two  stages,  with 
probabilities  p  and  1  —  p  at  rates  Ai  per  hour  and  A2  per  hour,  respectively).  We  parameterize 
these  distributions  in  terms  of  their  inverse  mean  (e)  as  follows: 

•  Ei(e)  d=  Erlang  (2,  2  e),  having  a  CV  (coefficient  of  variation)  =  0.707, 

•  ,ff2(e)  d=  Hyper  exponential  (0.2727,  0.3342  e,  4.01  e),  having  a  CV  =  2.0. 

We  simulate  the  system  using  our  method  with  the  batch  means  parameters  given  earlier  in 
Section  4.  For  two  values  of  e,  namely,  10-2  and  10~1,  in  Table  2  we  give  estimates  of  the  steady- 
state  unavailability  for  the  following  4  combinations  of  components’  failure  time  distributions: 

•  Cl:  iS2(e)  for  Type  I  and  E2(e)  for  Type  II 

•  C2:  Ei(e)  for  Type  I  and  F2(e15)  for  Type  II 

•  C3:  #2(0  for  Type  I  and  F2(e)  for  Type  II 

•  C4:  if2(e)  for  Type  I  and  JF2(e*  5)  for  Type  II. 

In  Cl  and  C3  the  system  is  ‘‘balanced”.  In  C"2  and  C4  the  system  is  “unbalanced”.  Notice  that 
the  relative  error  of  the  estimates  is  about  the  same  for  “balanced”  and  “unbalanced”  systems  and 
is  independent  of  the  value  of  e.  For  the  experiment  in  Cl,  we  ran  standard  simulation  with  the  same 
batch  means  parameters.  For  e  =  10-2,  it  produced  an  estimate  having  relative  error  ±13.41% 
(compared  with  ±5.26%  using  importance  sampling).  This  means  that  standard  simulation  should 
be  run  10  times  longer  to  achieve  about  the  same  accuracy  obtained  with  importance  sampling. 
No  failures  were  observed  for  e  =  10-4  using  standard  simulation. 

With  failure  propagation,  we  ran  the  same  experiments  using  our  method.  The  resulting 
estimates  are  given  in  Table  3.  Again,  the  accuracy  of  the  estimates  is  quite  consistent  throughout 
the  table. 
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4.3  A  Large  Example 

It  remains  to  show  that  the  method  described  here  is  also  feasible  and  effective  when  dealing  with 
large  and  complex  highly  dependable  systems.  In  this  section  we  consider  a  large  example  with 
many  types  of  components.  We  experiment  with  different  failure  time  distributions,  such  as  Erlang, 
Weibull,  exponential  and  hyperexponential.  Without  failure  propagation,  the  example  falls  within 
the  class  of  product  form  models  and  validation  with  numerical  (non  simulation)  results  is  possible. 

The  system  we  consider  is  based  on  a  model  of  a  fairly  complex  computing  system  (also 
considered  in  [24]).  The  computing  system  is  composed  of  two  sets  of  processors  with  2  processors 
per  set,  two  sets  of  controllers  with  2  controllers  per  set,  and  6  clusters  of  disks,  each  consisting 
of  4  disk  units.  In  a  disk  cluster,  data  is  replicated  so  that  one  disk  can  fail  without  affecting  the 
system.  The  ‘primary'’  data  on  a  disk  is  replicated  such  that  one  third  is  on  each  of  the  other 
three  disks  in  the  same  cluster.  Thus,  one  disk  in  each  cluster  can  be  down  without  losing  access 
to  the  data.  Components  are  repaired  by  a  single  repairman  according  to  a  FCFS  discipline.  The 
system  is  defined  to  be  operational  if  all  data  is  accessible  to  both  processor  types,  which  means 
that  at  least  one  processor  of  each  type,  one  controller  in  each  set,  and  3  out  of  4  disk  units  in 
each  of  the  6  disk  clusters  are  operational.  Operational  components  continue  to  fail  at  the  given 
rates  when  the  system  is  failed.  When  failure  propagation  is  considered,  a  failing  processor  in  any 
of  the  two  sets  causes  one  processor  in  the  other  set  to  fail  with  probability  0.1. 

All  repair  time  distributions  axe  exponential  with  mean  1  hour  (however,  any  general  distri¬ 
bution  could  be  allowed).  All  of  the  component  failure  times  follow  the  same  distribution,  with 
the  same  coefficient  of  variation  (CV),  but  possibly  with  different  means  (for  the  different  types  of 
components).  The  MTBF  for  processors,  controllers  and  disks  are  assumed  to  be  200000,  200000 
and  600000  hours,  respectively.  We  experiment  with  four  failure  time  distributions;  namely,  Erlang 
with  2  stages  (CV  =  0.707),  Weibull  with  a  shape  parameter  equal  to  1.25  (CV  =  0.805),  exponen¬ 
tial  (CV  =  1.0),  and  hyperexponential  with  2  stages  (CV  =  2.0).  For  the  Weibull  distribution,  the 
scale  parameters  corresponding  to  the  overall  means  200000  and  600000  are  equal  to  2.1634  x  10-T 
and  5.4793  x  10~8,  respectively.  For  the  hyperexponential  distribution,  the  parameters  are  as  fol¬ 
lows:  a  probability  equals  to  0.2727  of  branching  to  the  first  stage  with  a  mean  600000  (1800000) 
and  a  probability  equals  to  0.7273  of  branching  to  the  second  stage  with  a  mean  50000  (150000), 
corresponding  to  the  overall  mean  200000  (600000). 

Notice  that  uniformization  cannot  be  used  to  sample  from  a  Weibull  distribution,  since  its 
hazard  rate  function  is  not  bounded.  However,  as  we  do  in  this  example,  random  variates  from 
an  IFR  (increasing  failure  rate)  Weibull  can  be  arbitrarily  well  approximated  by  sampling  (using 
uniformization)  from  another  distribution  having  a  bounded  hazard  rate  function.  This  approx¬ 
imation  is  obtained  by  simply  bounding  the  hazard  rate  function  h(t)  of  the  (IFR)  Weibull  at 
Am  =  h(tm)  beyond  a  sufficiently  large  time  tm,  such  that  G(tm)  is  extremely  small,  say,  10~2°. 
If  Am  is  not  too  high  compared  to  other  hazard  rates  in  the  system,  then  a  reasonably  efficient 
uniformization  rate  can  be  used  to  generate  failure  events. 

We  simulate  the  described  system  using  our  method  (with  the  batch  means  parameters  given 
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in  Section  4)  to  get  estimates  of  the  steady-state  unavailability  for  two  sets  of  failure  distributions. 
In  the  first  oct  (Set  I)  we  use  the  cuuipoiicnts  nITCFs  given  above.  In  the  second  set  { S e t  II)  we 
reduce  all  components’  MTBFs  by  a  factor  of  10  (i.e.,  we  use  less  reliable  components).  Accordingly, 
all  means  in  the  Erlang,  exponential  and  hyperexponential  stages  axe  also  reduced  by  a  factor  of 
10.  For  the  Weibull  distribution,  the  scale  parameters  corresponding  to  the  overall  means  20000 
and  60000  are  equal  to  3.847  x  10-b  and  9.744  x  10— 7 ,  respectively. 

Without  failure  propagation,  the  above  example  has  a  product  form  solution,  which  depends 
on  the  failure  time  distributions  only  through  their  means.  It  follows  that  the  steady-state  un¬ 
availability  is  the  same  for  all  failure  time  distributions  having  the  same  mean.  Furthermore,  using 
SAVE,  we  can  obtain  numerical  estimates,  which  are  given  by  4.0  x  10-10  and  4.0  x  10-8  for  failure 
data  sets,  I  and  II,  respectively. 

In  Table  4  we  give  estimates  of  the  steady-state  unavailability  for  the  system  without  failure 
propagation,  for  both  sets  of  failure  data,  I  and  II.  All  relative  errors  in  this  table  are  less  than 
±10%.  Notice  the  agreement  among  the  estimates  for  different  failure  distributions,  on  one  hand, 
and  with  the  above  results  from  SAVE,  on  the  other  hand.  For  the  hyperexponential  failure  dis¬ 
tribution,  the  estimates  are  slightly  less  accurate  than  those  corresponding  to  failure  distributions 
with  a  lower  coefficient  of  variation.  For  this  (hyperexponential)  case,  standard  simulation  (with 
the  same  total  number  of  cycles)  produced  unstable  estimates  having  relative  errors  of  ±260% 
and  ±111%  for  failure  data  Sets  I  and  II,  respectively.  In  fact,  these  confidence  intervals  are 
meaningless,  since  they  contain  negative  values. 

In  Table  5  we  give  estimates  of  the  steady-state  unavailability  for  the  system  with  failure 
propagation,  for  both  sets  of  failure  data,  I  and  II.  A  preemptive-resume  discipline  at  the  repair 
facility  is  now  assumed,  with  processors  having  the  highest  priority  and  disks  having  the  lowest 
priority.  Notice  that  the  steady-state  unavailability  is  affected  only  a  little  by  the  failure  time 
distribution.  However,  as  expected,  the  point  estimates  are  consistently  higher  than  those  in  Table 
4.  Because  this  is  a  different  system,  the  confidence  intervals  happen  to  be  slightly  wider  than 
those  in  Table  4. 

Again,  in  each  of  the  Tables  4  and  5,  the  accuracies  of  the  estimates  are  about  the  same, 
regardless  of  the  failure  distributions  or  their  means.  These  empirical  results  are  consistent  with 
the  conjecture  that  our  method  produces  estimates  of  steady-state  unavailability  having  bounded 
relative  error. 

5  Conclusions 

This  paper  has  considered  the  problem  of  estimating  steady-state  quantities  for  models  of  highly 
dependable  computing  systems  in  which  the  component  failure  and  repair  times  have  general  distri¬ 
butions.  Such  models  are  analytically  and  numerically  intractable;  simulation  is  the  only  possible 
means  of  analysis.  However,  standard  simulation  is  inefficient  when  system  failure  events  are 
rare  and  importance  sampling  needs  to  be  used  to  speedup  the  simulation.  Earlier  importance 
sampling  approaches  are  effective  for  steady-state  estimation  only  when  the  failure  time  distribu- 
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tions  are  exponentially  distributed,  in  which  case  the  regenerative  structure  of  the  model  can  be 
exploited.  When  failure  times  are  generally  distributed,  no  such  iegeneiationa  cXi&t.  However, 
a  ratio  representation  in  terms  of  non-i.i.d.  cycles  still  exists  for  steady-state  quantities.  Using 
this  representation,  a  splitting  technique  can  be  devised  in  which  importance  sampling  is  used  to 
estimate  the  expected  downtime  during  a  cycle  and  standard  simulation  is  used  to  estimate  the 
expected  cycle  length.  The  particular  method  of  importance  sampling  that  we  use  is  based  on  uni- 
formization,  and  is  provably  effective  for  estimating  certain  transient  quantities  within  this  class  of 
models.  Experiments  showed  the  method  to  be  effective  in  practice  for  estimating  the  steady-state 
unavailability. 

As  a  result  of  this  work,  the  class  of  highly  dependable  systems  that  can  be  efficiently  simulated 
to  estimate  steady-state  measures  is  greatly  broadened. 
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Table  1:  Estimates  of  steady-state  unavailability  (xlO8)  in  a  machine  repairman  model 
(experiments  with  batch  size). 
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Table  2:  Estimates  of  steady-state  unavailability  in  a  machine  repairman  model 
without  failure  propagation  . 
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Table  3:  Estimates  of  steady-state  unavailability  in  a  machine  repairman  model 
with  failure  propagation. 
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Table  4:  Estimates  of  steady-state  unavailability  in  a  large  example  (without  failure  propagation). 
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Table  5:  Estimates  of  steady-state  unavailability  in  a  large  example  (with  failure  propagation). 


